Spotify is one of the larger music streaming services available today with 345 million active users 1. Instead of having to buy cd’s or download every song you listen to, Spotify allows access to millions of songs. Because Spotify is a centralized platform, it provides us with an opportunity to ask questions about what makes songs successful.
In order to investigate what makes some songs more successful than others, we will look at if certain features have a strong correlation with other features. In addition, we want to discover the most popular genre. Our data will specify a few genres that may be the most popular. Certain features will be strongly correlated to other features.
The data we are using is based on Spotify data from 1921 to 2020 including over 175,000 audio tracks.We found our data on Kaggle 2. This dataset groups the data by artist, genre, and year. There are nine different variables measured in the dataset. They are acousticness, danceability, duration, energy, liveness, instrumentalness, loudness, speechiness, valence, popularity, and tempo.
Energy (en) is a perceptual measure of the intensity and activity of a track on a scale from 0.0 to 1.0. Some of the perceptual features that are included in this are dynamic range, perceived loudness, timbre, onset rate, and general entropy. Liveness (li) ranges from 0 to 1 and detects if an audience is present in a recording. If the liveness value is above 0.8, there is a strong likelihood that the track is live. Acousticness (ac) is the confidence measure of the track being acoustic. It varies from 0.0 to 1.0, with 1.0 representing high confidence that the track is acoustic. Loudness (lo) ranges from -60 to 0 and is measured in decibels (dB). It suggests the overall loudless averaged over the entire track. The measure of danceability (db) includes a combination of tempo, rhythm stability, beat strength and regularity. It rates how suitable a track is for dancing from 0.0 to 1.0 with 1 being the most danceable. Duration (dur) measures the length of the track in milliseconds (ms). The instrumentalness (ins) feature tracks whether a song contains vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are considered vocal. Instrumentalness ranges from 0 to 1.0 with 1.0 being the most instrumental. Speechiness (sp) is the opposite of instrumentalness, measuring the relative length of the track containing any kind of human voice. The tempo (tmp) feature gives information on the tempo of the track in Beat Per Minute (BPM). Valence (val) measures the positiveness of the track, higher valence relates to more cheerful and upbeat songs. Lastly, popularity (pop) is calculated by an algorithm that is based on the total number of plays the track has had and how recent those plays are.
In the rest of our report, we will first group the genres into broader categories and then analyze the features throughout the genres. We will also compare features to each other and test the correlations between two features to see if they have a strong linear relationship or not. Lastly, we will discover which genres are the most popular by using t-tests comparing the genre popularity means. In the end, we will show how popularity is related to different genres, as well and how different features relate to each other.
There were 3232 genres. We condensed these into the top 20 occurring terms in these genres using regular expressions and counting the occurrences. We did this because as you can see below, there was initially way too many genres due to their specificity. So, in order to do meaningful analysis we condensed these genres.
Here are all the original genres. As you can see there are thousands of them, and most of them are obscure. Some seem to not even make sense (like the genre “[]”).
This is the top 100 terms found from all the genres. These terms will be used to create the simplified genres. Note that some of the original genres are double counted, such as original genre african rock being in the simplified genres african and rock. From these counts we concluded that we would use the top 20 of these simplified genres.
These are the top 20 terms found. They are the ones we will use. Note this only uses 60.7% of the data since 39.3% of the data do not fall under these top 20 categories. We concluded that for our purposes, this means that the top 20 are sufficient simplified genres to do analysis on. We chose the top 20 since after the top 20 there seemed to to be diminishing returns in adding more genres. For example, the 20th top genre techno, had a count of 41 which is not much more than the top 25th and 50th genres, alternative and dance, which have counts of 36 and 21 respectively. The genre with the largest count was pop with a count of 257. With this in mind, we decided there wasn’t much value in including these smaller genres and considered only the top 20 genres.
We use these top 20 to create a more concisely labeled dataset. We also include the label other to account for the other 1743 genres with their occurrences less than or equal to 39.
This graph shows the number of occurrances of each simplified genre.
This graph shows the number of occurrances of each simplified genre without the “other” category (so the scale is slighly easier to read).
The second question we want to answer is to see if any features have strong linear correlations to other features. To do this, we use r-values and their corresponding graphs.
First, we found r-value between combinations of all the features, which is shown in the table below.
The raw r-values between every combination of features.
The r-values in an easier to view format. The red and blue show signify higher correlations.
As the table above shows, some features seem to have strong linear relationships, while some features seem to not have a strong linear relationship. To isolate those, we filtered for the absolute value of r-values only over .9 to find the strongest feature relations. We chose .9 as a threshold arbitrarily since there were many features that correlated. You can see some other thresholds below.
key:
ac = acousticness, db = danceability, dur = duration_ms, en = energy, ins = instrumentalness, li = liveness, lo = loudness, sp = speechiness, tmp = tempo, val = valence, pop = popularity
0.9There are 3 r-values above this threshold.
0.8There are 10 r-values above this threshold.
0.7There are 11 r-values above this threshold.
0.6There are 18 r-values above this threshold.
noneThere are 55 r-values above this threshold.
In the below graph and table we can again see that there are strong correlations between energy and the other features acousticness, loudness, and tempo. However, it is also interesting to note that those same three features that we found correlate strongly with energy also correlate with each other, although to a lesser degree.
It is hard to say what this means exactly, but it does suggest a few possibilities, and speak to the difference between correlation and causation. For example, there is a r-value of -0.8715355 between tempo and loudness. However, since we know that both those features correlate even stronger with energy, it may be possible that what is more significant is their relation to energy. This shows that these features are all highly related, and the fact that they all also correlate highly with each other suggests these features all measure for something similar.
Plotted feature vs feature.
The raw r-values.
As mentioned before, we chose the above threshold of 0.9 somewhat arbitrarily. This was partially motivated by not wanting to make section 3.2.3 too complicated since the number of comparisons between features grows quadratically.
However, for completeness, we will show the relationships for the top correlated features above a 0.8 threshold since there seem to be some interesting relationships here as well. Comparing each of the features, ten r-values are above a threshold of 0.8. Acousticness, energy, instrumentalness, and loudness had a correlation over 0.8 with the popularity features. Acousticness, energy and loudness hit the threshold r-value with tempo. Finally, loudness had a strong correlation with acousticness and energy, and acoustiness had a high r-value with energy. It is interesting that all of the same features have these high correlations to each other.
Below are the (seven) feature pairs above that were not included in the 0.9 section.
A negative relationship with r-value -0.8463968.
A negative relationship with r-value -0.8715355.
A negative relationship with r-value -0.8787419.
A positive relationship with r-value 0.8851824.
A negative relationship with r-value -0.8074837.
A positive relationship with r-value 0.8097074.
A positive relationship with r-value 0.8820383.
These are all of the possible combinations of the features that have some correlation that pass the threshold of 0.8 similar to what was explored in section 3.2.3. Since there are many more features here it’s a little more noisy (this is why we chose a threshold of 0.9 previously). We will not do analysis similar to what was done in 3.2.3 for this reason, but will include these graphics and r-values for completeness.
Table of the r-values.
We created density plots to get an initial idea of the genres relations to each feature. These are messy so in the next section we will try to make sense of them. In particular, we will analyze the popularity feature as it relates to each genre.
We cannot run a t-test on a population that isn’t normal. So to check for normality, we plot the populations that we are testing. Of these, popularity is the most clearly normal so we will do most of our analysis with popularity. However, some other graphs seem possibly normal, such as danceability and liveness.
We ran t-tests to find differences between genres in the different features we considered normal. The t-test statistic3 is as follows:
\[ t = \frac{m_a - m_b}{\sqrt{\frac{s_a^2}{n_a}+\frac{s_b^2}{n_b}}} \]
We use this test statistic to calculate the p-value by finding the corresponding quantile from the student t distribution with \(\max(n_a, n_b)-1\) degrees of freedom. While we will focus on analyzing popularity in particular in the next section, we do this test for danceability and popularity as well. The results of these tests are shown below.
All t-test between genres for danceability.
Filtered for only significant differences (p-value < 0.05).
All t-test between genres for liveness.
Filtered for only significant differences (p-value < 0.05).
All t-test between genres for popularity.
Filtered for only significant differences (p-value < 0.05).
We decided to further examine the popularity of the genres in depth to see if we could discover the most popular genre(s). We focus on analyzing the feature popularity because we are the most confident it is normal, and also because the results make sense contextually.
We made a graph to compare the box plots of all the genres with their popularity in order to get an overview of the distributions before jumping into our t tests. Overall, the boxplots show that rap has the highest mean popularity for all of the genres. Next, we will use t-tests to evaluate if the difference in the means of the genres is significant enough for us to conclude that rap has the highest mean popularity.
We made a density plot of our estimator, the mean popularity, to check that it looks normally distributed. We did this so we know using a t-test is appropriate. As you can see, the plot seem normally distributed so a t-test is appropriate.
We then ran the t-test for the feature popularity between all genres (same as 3.3.3).
rap column.Next we isolated the genres that didn’t have p-value < 0.05 to find the genres cannot be dismissed as having a different popularity from rap.
To see if this makes sense, we show the mean and standard deviation of popularity from these genres. As you can see, their means were very similar, so it makes sense that their p-values were not significant.
To put these results back into context, we come back to the original boxplots we had. Now, we can draw a line that divides what we can consider the most popular genres from the rest. In this way we can conclude that the simplified genres rap, hip, and swedish are the most popular genres.
If we were to determine the most popular we would need to observe a significant difference between all other values and the top popularity genre mean. From our t-tests to test the genre popularity means, we were not able to come to a definite conclusion on which genre is the most popular. Because all of the p-values are not below .05, we do not have statistical evidence to reject the null that the rap mean is significantly higher than the rest of the genres. Hip and swedish have p-values above .05, so they could all still be the most popular. However, for the genres with p-values below .05, we do have statistical significance evidence that they are not the most popular. This information could be useful for someone trying to create a song because it will give them input into which genres are most popular with listeners. By creating a song in a more popular genre, people may be more likely to listen to the song, which can generate more revenue for the artist.
We also discovered which features have the strongest linear correlations to each other, vs which features have no linear relationship. We found that energy has a correlation over the absolute value of 0.9 to three other features, acousticness, loudness and tempo. Acousticness has a negative correlation with energy while loudness and tempo both have positive correlations with energy. Considering that acousticness, loudness, and tempo are all measured based on set measurements, while energy is calculated from intensity and activity in the song, we can infer that acousticness, loudness, and tempo all affect the energy of a song. This finding matters because when a producer is trying to make different characteristics of a song come together perfectly, the correlations between specific features may help them adjust said features in order to compliment each other better.
A short-coming of our analysis is that we do not know how many songs are included in the data for each genre. Some genre’s data may be based on more songs than other genres. In addition, because we only filtered the top 20 highest strings to group genres, some of the genres are not included in our analysis.
Another short-coming was the way in which we created our simplified genres. We used regular expressions that split the original genre names by whitespace, but this resulted in multiple issues. One was the issue of double-counting genres such as african rap being counted as african and rap. Another issue was the fact that some genres like hip-hop wouldn’t always be seperated by a hyphen in the dataset, so in our top 20 genres we actually have two genres called hip and hop that likely correspond to just one genre (hip-hop). One possible fix to this might have been to do preprocessing for these special cases like replacing the space between "hip hop" so it became "hip-hop".
Future work on this dataset could involve testing out more of the features relationships and seeing if they have strong models. We could also look for datasets from other music streaming services, such as Apple Music and Pandora.